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Abstract 


In this paper, the answer selection problem 
in community question answering (CQA) 
is regarded as an answer sequence label¬ 
ing task, and a novel approach is proposed 
based on the recurrent architecture for this 
problem. Our approach applies convo¬ 
lution neural networks (CNNs) to learn¬ 
ing the joint representation of question- 
answer pair firstly, and then uses the joint 
representation as input of the long short¬ 
term memory (LSTM) to learn the answer 
sequence of a question for labeling the 
matching quality of each answer. Exper¬ 
iments conducted on the SemEval 2015 
CQA dataset shows the effectiveness of 
our approach. 


1 Introduction 


Answer selection in community question answer¬ 
ing (CQA), which recognizes high-quality re¬ 
sponses to obtain useful question-answer pairs, 
is greatly valuable for knowledge base construc¬ 
tion and information retrieval systems. To rec¬ 
ognize matching answers for a question, typi¬ 
cal approaches model semantic matching between 
question and answer by exploring various fea¬ 
tures (Wang et al., 2009a; Shah and Pomerantz, 
2010). Some studies exploit syntactic tree struc¬ 


tures ( Wang et al., 2009b} Moschitti et al., 2007] ) to 
measure the semantic matching between question 
and answer. However, these approaches require 
high-quality data and various external resources 
which may be quite difficult to obtain. To take ad¬ 
vantage of a large quantity of raw data, deep learn¬ 
ing based approaches ( Wang et al., 2010} Hu et 
al., 2013) are proposed to learn the distributed rep¬ 
resentation of question-answer pair directly. One 
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As- 

/! 

Hi. anyone can suggest a good tailor shop (preferably 

Philippine nationality) in Qatar? i heard there's one over at Al 
Saad. just not sure the details... thanks! 

i ‘liL_ 
\ 

There are a lot of tailor shops, it depends on what you want! 

>GE 

i! 

V 

Sterling Tailors in Barwa Village, it is run by indians and sri 
lankans but service is good. I've seen some filipinos who are 
taking orders from them. Just Check it out... 

'j a3 

thanks, will def check 'em out... 

X a4 

i 

Oh my...they now sell Filipinos? Is there anything they don't 
sell? Well, apart from Guitar Hero... 

1 a5 

there's always a place for improvement, lol,. 


Figure 1: An Example of the Answer Sequence for 
a Question. The dashed arrows depict the relation¬ 
ships of the answers in the sequence. 


disadvantage of these approaches lies in that se¬ 
mantic correlations embedded in the answer se¬ 
quence of a question are ignored, while they are 
very important for answer selection. Figure [I] is a 
example to show the relationship of answers in the 
sequence for a given question. Intuitively, other 
answers of the question are beneficial to judge the 
quality of the current answer. 


Recently, recurrent neural network (RNN), 
especially Long Short-Term Memory 
(LSTM) (Hochreiter et al., 2001), has been 


proved superiority in various tasks (Sutskever et 


|al., 2014] Srivastava et al., 2015 ) and it models 
long term and short term information of the 
sequence. And also, there are some works on 
using convolutional neural networks (CNNs) to 
learn the representations of sentence or short text, 
which achieve state-of-the-art performance on 
sentiment classification ( |Kim, 2014| ) and short 
text matching (|Hu et al., 2014). 


In this paper, we address the answer selection 
problem as a sequence labeling task, which iden¬ 
tifies the matching quality of each answer in the 
answer sequence of a question. Firstly, CNNs 
are used to learn the joint representation of ques- 
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tion answer (QA) pair. Then the learnt joint rep¬ 
resentations are used as inputs of LSTM to pre¬ 
dict the quality (e.g., Good, Bad and Potential) of 
each answer in the answer sequence. Experiments 
conducted on the CQA dataset of the answer se¬ 
lection task in SemEval-201 .‘Q show that the pro¬ 
posed approach outperforms other state-of-the-art 
approaches. 



2 Related Work 


Figure 2: The architecture of R-CNN 


Prior studies on answer selection generally treated 
this challenge as a classification problem via em¬ 
ploying machine learning methods, which rely on 
exploring various features to represent QA pair. 


Huang et al. (2007) integrated textual features with 


structural features of forum threads to represent 
the candidate QA pairs, and used support vector 
machine (SVM) to classify the candidate pairs. 
Beyond typical features, Shah and Pomerantz 


(2010) trained a logistic regression (LR) classi¬ 


fier with user metadata to predict the quality of 


answers in CQA. Ding et al. (2008) proposed 
an approach based on conditional random fields 
(CRF), which can capture contextual features from 
the answer sequence for the semantic matching 
between question and answer. Additionally, the 
translation-based language model was also used 
for QA matching by transferring the answer to the 


corresponding question (Jeon et al., 2005; Xue et 


|al., 20081 |Zhou et al., 2011 1 . The translation-based 
methods suffer from the informal words or phrases 
in Q&A archives, and perform less applicability in 
new domains. 


In contrast to symbolic representation, Wang 


et al. (2010) proposed a deep belief nets (DBN) 


based semantic relevance model to learn the dis¬ 
tributed representation of QA pair. Recently, the 
convolutional neural networks (CNNs) based sen¬ 
tence representation models have achieved suc¬ 
cesses in neural language processing (NLP) tasks. 


Yu et al. (2014 1 proposed a convolutional sentence 


model to identify answer contents of a question 
from Q&A archives via means of distributed rep¬ 
resentations. The work in [Hu et al. (2014| demon¬ 
strated that 2-dimensional convolutional sentence 
models can represent the hierarchical structures of 
sentences and capture rich matching patterns be¬ 
tween two language objects. 


1 http: //alt. qcri. org/seme va!2015/task3/ 


3 Approach 

We consider the answer selection problem in CQA 
as a sequence labeling task. To label the matching 
quality of each answer for a given question, our 
approach models the semantic links between suc¬ 
cessive answers, as well as the semantic relevance 
between question and answer. Figure 1 summa¬ 
rizes the recurrent architecture of our model (R- 
CNN). The motivation of R-CNN is to learn the 
useful context to improve the performance of an¬ 
swer selection. The answer sequence is modeled 
to enrich semantic features. 

At each step, our approach uses the pre-trained 
word embeddings to encode the sentences of QA 
pair, which then is used as the input vectors of 
the model. Based on the joint representation of 
QA pair learned from CNNs, the LSTM is applied 
in our model for answer sequence learning, which 
makes a prediction to each answer of the question 
with softmax function. 

3.1 Convolutional Neural Networks for QA 
Joint Learning 

Given a question-answer pair at the step t, we use 
convolutional neural networks (CNNs) to learn the 
joint representation p t for the pair - . Figure 2 illus¬ 
trates the process of QA joint learning, which in¬ 
cludes two stages: summarizing the meaning of 
the question and an answer, and generating the 
joint representation of QA pair. 

To obtain high-level sentence representations of 
the question and answer, we set 3 hidden layers 
in two convolutional sentence models respectively. 
The output of each hidden layer is made up of a 
set of 2-dimensional arrays called feature map pa¬ 
rameters (w m . b rn ). Each feature map is the out¬ 
come of one convolutional or pooling filter. Each 
pooling layer is followed an activation function cr. 
The output of the rn th hidden layer is computed as 
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Figure 3: CNNs for QA joint learning 


Eq-B 

H m = <r(jpool(wmH m -1 + b m )) (1) 

Here, Hq is one real-value matrix after sentence 
semantic encoding by concatenating the word vec¬ 
tors with sliding windows. It is the input of deep 
convolution and pooling, which is similar to that 
of traditional image input. 

Finally, we combine the two sentence models 
by adding an additional layer H t on the top. The 
learned joint representation p t for QA pair is for¬ 
malized as Eq. B 

Pt = cr{wtH t + b t ) (2) 


where a is an activation function, and the input 
vector is constructed by concatenating the sen¬ 
tence representations of question and answer. 


3.2 LSTM for Answer Sequence Learning 

Based on the joint representation of QA pair, the 
LSTM unit of our model performs answer se¬ 
quence learning to model semantic links between 
continuous answers. Unlike the traditional recur¬ 
rent unit, the LSTM unit modulates the memory 
at each time step, instead of overwriting the states. 
The key component of LSTM unit is the memory 
cell ct which has a state over time, and the LSTM 
unit decides to modify and add the memory in the 
cell via the sigmoidal gates: input gate i t , forget 
gate ft and output gate ot- The implementation 
of the LSTM unit in our study is close the one dis¬ 
cussed by Graves (2013). Given the joint represen¬ 
tation p t at time t, the memory cell c t is updated by 
the input gate’s activation i t and the forget gate’s 
activation f t . The updating equation is given by 
Eq.g 


ct = ftc t -i+ittanh(W xc pt + W hc ht-i+b c ) (3) 


Data 

#question 

#answer 

length 

training 

2600 

16541 

6.36 

development 

300 

1645 

5.48 

test 

329 

1976 

6.00 

all 

3229 

21062 

6.00 


Table 1: Statistics of experimental dataset 


The LSTM unit keeps to update the context by 
discarding the useless context in forget gate f t and 
adding new content from input gate i t . The ex¬ 
tents to modulate context for these two gates are 
computed as Eq.[4]and Eq. [5] 


it — cr(W x iPt + Whih(t_i) + WdCt—i + bt) (4) 


ft = <?{W x fpt + W h fht-i + W c fCt-i + bf) (5) 

With the updated cell state ct, the final output 
from LSTM unit h t is computed as Eq[6j 


ot = o(W xo pt + W ho h t -i + W co ct + b 0 ) (6) 

h t = o t tanh(c t ) (7) 

Note that (IT 7 *, &*) is the parameters of LSTM 
unit, in which W c f,W c t , and W co are diagonal 
matrices. 

According to the output ht at each time step, 
our approach estimates the conditional probability 
of the answer sequence over answer classes, it is 
given by Eq. [8] 

P{yi,-,yT\c,pi,...,pt-i) = 

T 

Y[p(yt\c, yi,-,yt- 1 ) 

t =i 

Here, (yi,...,yr) is the corresponding label se¬ 
quence for the input sequence (p\, ...,p t ~ i), and 
the class distribution p(yt\c, yi ,..., .yt-i) is repre¬ 
sented by a softmax function. 

4 Experiments 

4.1 Experiment Setup 

Experimental Dataset: We conduct experiments 
on the public dataset of the answer selection chal¬ 
lenge in SemEval 2015. This dataset consists of 
three subsets: training, development, and test sets, 
and contains 3,229 questions with 21,062 answers. 


















































The answers falls into three classes: Good, Bad, 
and Potential, accounting for 51%, 39%, and 10% 
respectively. The statistics of the dataset are sum¬ 
marized in Table [T] where #question/answer de¬ 
notes the number of questions/answers, and length 
stands for the average number of answers for a 
question. 

Competitor Methods: We compare our approach 
against the following competitor methods: 

SVM (Huang et al., 2007): An SVM-based 
method with bag-of-words (textual features), non¬ 
textual features, and features based on topic model 
(i.e., latent Dirichlet allocation, LDA). 

CRF (Ding et al., 2008 1 : A CRF-based method 
using the same features as the SVM approach. 

DBN (Wang et al., 2010 1 : Taking bag-of-words 
representation, the method applies deep belief nets 
to learning the distributed representation of QA 
pair, and predicts the class of answers using a lo¬ 
gistic regression classifier on the top layer. 

mDBN ( ]Hu et al., 2013 ): In contrast to DBN, 
multimodal DBN learns the joint representations 
of textual features and non-textual features rather 
than bag-of-words. 

CNN: Using word embedding, the CNNs based 
model in Hu et al. (2014} ) is used to learn the rep¬ 
resentations of questions and answers, and a logis¬ 
tic regression classifier is used to predict the class 
of answers. 

Evaluation Metrics: The evaluation metrics 
include Macro — precision(P), Macro — 
recall(R), Macro — F1(F1), and FI scores of 
the individual classes. According to the evalua¬ 
tion results on the development set, all the hyper¬ 
parameters are optimized on the training set. 
Model Architecture and Training Details: The 
CNNs of our model for QA joint representation 
learning have 3 hidden layers for modeling ques¬ 
tion and answer sentence respectively, in which 
each layer has 100 feature maps for convolution 
and pooling operators. The window sizes of con¬ 
volution for each layer are [1 x 1 , 2 x 2 , 2 x 2 ], the 
window sizes of pooling are [2 x 2 , 2 x 2,1 x 1 ], 
For the LSTM unit, the size of input gate is set 
to 200 , the sizes of forget gate, output gate, and 
memory cell are all set to 360. 

Stochastic gradient descent (SGD) algorithm 
via back-propagation through time is used to train 
the model. To prevent serious overfitting, early 
stopping and dropout ( Hinton et al., 2012] ) are used 
during the training procedure. The learning rate 


Methods 

P 

R 

LI 

SVM 

50.10 

54.43 

52.14 

CRL 

53.89 

54.26 

53.40 

DBN 

55.22 

53.80 

54.07 

mDBN 

56.11 

53.95 

54.29 

CNN 

55.33 

54.73 

54.42 

R-CNN 

56.41 

56.16 

56.14 


Table 2: Macro-averaged results(%) 


A is initialized to be 0.01 and is updated dynam¬ 
ically according to the gradient descent using the 
ADADELTA method dZeiler, 2012| . The activa¬ 
tion functions (a, 7 ) in our model adopt the rec¬ 
tified linear unit (ReLU) ( |Dahl et al., 2013[ ). In 
addition, the word embeddings for encoding sen¬ 
tences are pre-trained with the unsupervised neu¬ 
ral language model ( jMikolov et al., 2013| ) on the 
Qatar Living datrj^] 

4.2 Results and Analysis 

Table [2] summarizes the Macro-averaged results. 
The FI scores of the individual classes are pre¬ 
sented in Table 0 

It is clear to see that the proposed R-CNN ap¬ 
proach outperforms the competitor methods over 
the Macro-averaged metrics as expected from Ta¬ 
ble [2] The main reason lies in that R-CNN takes 
advantages of the semantic correlations between 
successive answers by LSTM, in addition to the 
semantic relationships between question and an¬ 
swer. The joint representation of QA pair - learnt 
by CNNs also captures richer matching patterns 
between question and answer than other methods. 

It is notable that the methods based on deep 
learning perform more powerful than SVM and 
CRL, especially for complicate answers (e.g.. Po¬ 
tential answers). In contrast, SVM and CRL using 
a large amount of features perform better for the 
answers that have obvious tendency (e.g., Good 
and Bad answers). The main reason is that the 
distributed representation learnt from deep learn¬ 
ing architecture is able to capture the semantic re¬ 
lationships between question and answer. On the 
other hand, the feature-engineers in both SVM and 
CRL suffer from noisy information of CQA and 
the feature sparse problem for short questions and 
answers. 

Compared to DBN and mDBN, CNN and R- 
CNN show their superiority in modeling QA pair. 

2 http://alt.qcri.org/semeval2015/task3/index.php?id=data- 

and-tools 
























Methods 

Good 

Bad 

Potential 

SVM 

79.78 

76.65 

0.00 

CRF 

79.32 

75.50 

5.38 

DBN 

76.99 

71.33 

13.89 

mDBN 

77.74 

70.39 

14.74 

CNN 

76.45 

74.77 

12.05 

R-CNN 

77.31 

75.88 

15.22 


Table 3: FI scores for the individual classes(%) 


The convolutional sentence models, used in CNN 
and R-CNN, can learn the hierarchical structure of 
language object by deep convolution and pooling 
operators. In addition, both R-CNN and CNN en¬ 
code the sentence into one tensor, which makes 
sure the representation contains more semantic 
features than the bag-of-words representation in 
DBN and mDBN. 


5 Conclusions and Future Work 

In this paper, we propose an answer sequence 
learning model R-CNN for the answer selection 
task by integrating LSTM unit and CNNs. Based 
on the recurrent architecture of our model, our ap¬ 
proach is able to model the semantic link between 
successive answers, in addition to the semantic 
relevance between question and answer. Experi¬ 
mental results demonstrate that our approach can 
learn the useful context from the answer sequence 
to improve the performance of answer selection in 
CQA. 

In the future, we plan to explore the meth¬ 
ods on training the unbalance data to improve the 
overall performances of our approach. Based on 
this work, more research can be conducted on 
topic recognition and semantic roles labeling for 
human-human conversations in real-world. 


The improvement achieved by R-CNN over 
CNN demonstrates that answer sequence learning 
is able to improve the performance of the answer 
selection in CQA. Because modeling the answer 
sequence can enjoy the advantage of the shared 
representation between successive answers, and 
complement the classification features with the 
learnt useful context from previous answers. Fur¬ 
thermore, memory cell and gates in LSTM unit 
modify the valuable context to pass onwards by 
updating the state of RNN during the learning pro¬ 
cedure. 


The main improvement of R-CNN against with 
the competitor methods comes from the Potential 
answers, which are much less than other two type 
of answers. It demonstrates that R-CNN is able to 
process the unbalance data. In fact, the Potential 
answers are most difficult to identify among the 
three types of answers as Potential is an intermedi¬ 
ate category (Marquez et al., 2015). Nevertheless, 
R-CNN achieves the highest FI score of 15.22% 
on Potential answers. In CQA, Q&A archives usu¬ 
ally form one multi-parties conversation when the 
asker gives feedbacks (e.g., “ok” and “please”) to 
users responses, indicating that the answers of one 
question are sematic related. Thus, it is easy to un¬ 
derstand that R-CNN performs better performance 
than competitor methods, especially on the recall. 
The reason is that R-CNN can model semantic cor¬ 
relations between successive answers to learn the 
context and the long range dependencies in the an¬ 
swer sequence. 
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